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Abstract 

Labeling of sentence boundaries is a nec- 
essary prerequisite for many natural lan- 
guage processing tasks, including part-of- 
speech tagging and sentence alignment. 
End-of-sentence punctuation marks are 
ambiguous; to disambiguate them most 
systems use brittle, special-purpose regular 
expression grammars and exception rules. 
As an alternative, we have developed an ef- 
ficient, trainable algorithm that uses a lex- 
icon with part-of-speech probabilities and 
a feed-forward neural network. This work 
demonstrates the feasibility of using prior 
probabilities of part-of-speech assignments, 
as opposed to words or definite part-of- 
speech assignments, as contextual infor- 
mation. After training for less than one 
minute, the method correctly labels over 
98.5% of sentence boundaries in a corpus of 
over 27,000 sentence-boundary marks. We 
show the method to be efficient and easily 
adaptable to different text genres, includ- 
ing single-case texts. 



1 Introduction 



Labeling of sentence boundaries is a necessary 
prerequisite for many natural language process- 
ing (NLP) tasks, including part-of-speech tagging 
( phurch, 19"88| ), (putting et al 



1991), and sen- 
tence alignment (Gale and Church, 1993|), (Kay 



and Roscheisen, 1993). End-of-sentence punctuation 



marks are ambiguous; for example, a period can de- 
note an abbreviation, the end of a sentence, or both, 
as shown in the examples below: 
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(2) "This issue crosses party lines and crosses 
philosophical lines!" said Rep. John Rowland 
(R., Conn.). 



Riley (1989) determined that in the Tagged Brown 
corpus ( Francis and Kucera, 1982 ) about 90% of pe- 
riods occur at the end of sentences, 10% at the end 
of abbreviations, and about 0.5% as both abbrevi- 
ations and sentence delimiters. Note from example 
(2) that exclamation points and question marks are 
also ambiguous, since they too can appear at loca- 
tions other than sentence boundaries. 

Most robust NLP systems, e.g.. Cutting et al. 



(1) The group included Dr. 
Boone Pickens Jr. 



J.M. Freeman and T. 



(1991), find sentence delimiters by tokenizing the 
text stream and applying a regular expression gram- 
mar with some amount of look-ahead, an abbrevia- 
tion list, and perhaps a list of exception rules. These 
approaches are usually hand-tailored to the particu- 
lar text and rely on brittle cues such as capitalization 
and the number of spaces following a sentence delim- 
iter. Typically these approaches use only the tokens 
immediately preceding and following the punctua- 
tion mark to be disambiguated. However, more con- 
text can be necessary, such as when an abbreviation 
appears at the end of a sentence, as seen in (3a-b): 



(3a) It was due Friday by 5 p.m. Saturday would be 
too late. 

(3b) She has an appointment at 5 p.m. Saturday to 
get her car fixed. 

or when punctuation occurs in a subsentence within 
quotation marks or parentheses, as seen in Example 
(2). 

Some systems have achieved accurate boundary 
determination by applying very large manual effort. 
For example, at Mead Data Central, Mark Wasson 
and colleagues, over a period of 9 staff months, de- 
veloped a system that recognizes special tokens (e.g., 
non-dictionary terms such as proper names, legal 



statute citations, etc.) as well as sentence bound- 
aries. From this, Wasson built a stand-alone bound- 
ary recognizer in the form of a grammar converted 
into finite automata with 1419 states and 18002 
transitions (excluding the lexicon). The resulting 
system, when tested on 20 megabytes of news and 
case law text, achieved an accuracy of 99.7% at 
speeds of 80,000 characters per CPU second on a 
mainframe computer. When tested against upper- 
case legal text the algorithm still performed very 
well, achieving accuracies of 99.71% and 98.24% on 
test data of 5305 and 9396 periods, respectively. It 
is not likely, however, that the results would be this 
strong on lower-case data J|_ 



Humphrey and Zhou ( 1989 ) report using a feed- 
forward neural network to disambiguate periods, al- 
though they use a regular grammar to tokenize the 
text before training the neural nets, and achieve an 
accuracy avera ging 93%.g 

Riley (198£) describes an approach that uses re- 
gression trees (Brciman et al., 1984) to classify sen- 
tence boundaries according to the following features: 

Probability[word preceding "." occurs at end of 
sentence] 

Probability[word following "." occurs at begin- 
ning of sentence] 

Length of word preceeding "." 

Length of word after "." 

Case of word preceeding "." : Upper, Lower, 
Cap, Numbers 

Case of word following ".": Upper, Lower Cap, 
Numbers 

Punctuation after "." (if any) 

Abbreviation class of words with "." 

The method uses information about one word of 
context on either side of the punctuation mark and 
thus must record, for every word in the lexicon, the 
probability that it occurs next to a sentence bound- 
ary. Probabilities were compiled from 25 million 
words of pre-labeled training data from a corpus of 
AP newswire. The results were tested on the Brown 
corpus achieving an accuracy of 99. 8%.^ 

^All information about Mead's system is courtesy of 
a personal communication with Mark Wasson. 

^Accuracy results were obtained courtesy of a per- 
sonal communication with Joe Zhou. 

^Time for training was not reported, nor was the 
amount of the Brown corpus against which testing was 
performed; we assume the entire Brown corpus was used. 



Miiller ( |198GD provides an exhaustive analysis 
of sentence boundary disambiguation as it relates 
to lexical endings and the identification of words 
surrounding a punctuation mark, focusing on text 
written in English. This approach makes multi- 
ple passes through the data and uses large word 
lists to determine the positions of full stops. Ac- 
curacy rates of 95-98% are reported for this method 
tested on over 75,000 scientific abstracts. (In con- 
trast to Riley's Brown corpus statistics, Miiller re- 
ports sentence-ending to abbreviation ratios ranging 
from 92.8%/7.2% to 54.7%/45.3%. This implies a 
need for an approach that can adapt flexibly to the 
characteristics of different text collections.) 

Each of these approaches has disadvantages to 
overcome. We propose that a sentence-boundary 
disambiguation algorithm have the following char- 
acteristics: 

• The approach should be robust, and should 
not require a hand-built grammar or special- 
ized rules that depend on capitalization, mul- 
tiple spaces between sentences, etc. Thus, the 
approach should adapt easily to new text genres 
and new languages. 

• The approach should train quickly on a small 
training set and should not require excessive 
storage overhead. 

• The approach should be very accurate and ef- 
ficient enough that it does not noticeably slow 
down text preprocessing. 

• The approach should be able to specify "no 
opinion" on cases that are too difficult to dis- 
ambiguate, rather than making underinformed 
guesses. 

In the following sections we present an approach 
that meets each of these criteria, achieving perfor- 
mance close to solutions that require manually de- 
signed rules, and behaving more robustly. Section 
g describes the algorithm, Section g describes some 
experiments that evaluate the algorithm, and Sec- 
tion summarizes the paper and describes future 
directions. 

2 Our Solution 

We have developed an efficient and accurate auto- 
matic sentence boundary labeling algorithm which 
overcomes the limitations of previous solutions. The 
method is easily trainable and adapts to new text 
types without requiring rewriting of recognition 
rules. The core of the algorithm can be stated con- 
cisely as follows: the part-of-speech probabilities of 



the tokens surrounding a punctuation mark are used 
as input to a feed-forward neural network, and the 
network's output activation value determines what 
label to assign to the punctuation mark. 

The straightforward approach to using contextual 
information is to record for each word the likelihood 
that it appears before or after a sentence bound- 
ary. However, it is expensive to obtain probabilities 
for likelihood of occurrence of all individual tokens 
in the positions surrounding the punctuation mark, 
and most likely such information would not be use- 
ful to any subsequent processing steps in an NLP 
system. Instead, we use probabilities for the part- 
of-speech categories of the surrounding tokens, thus 
making training faster and storage costs negligible 
for a system that must in any case record these prob- 
abilities for use in its part-of-speech tagger. 

This approach appears to incur a cycle: because 
most part-of-speech taggers require pre-determined 
sentence boundaries, sentence labeling must be done 
before tagging. But if sentence labeling is done be- 
fore tagging, no part-of-speech assignments are avail- 
able for the boundary-determination algorithm. In- 
stead of assigning a single part-of-speech to each 
word, our algorithm uses the prior probabilities of 
all parts-of-speech for that word. This is in contrast 
to Riley's method ( Riley, 1989 ) which requires prob- 
abilities to be found for every lexical item (since it 
records the number of times every token has been 
seen before and after a period). Instead, we suggest 
making use of the unchanging prior probabilities for 
each word already stored in the system's lexicon. 

The rest of this section describes the algorithm in 
more detail. 

2.1 Assignment of Descriptors 

The first stage of the process is lexical analysis, 
which breaks the input text (a stream of characters) 
into tokens. Our implementation uses a slightly- 
modified version of the tokenizer from the PARTS 
part-of-speech tagger ( phurch, 1988 ) for this task. 
A token can be a sequence of alphabetic characters, 
a sequence of digits (numbers containing periods act- 
ing as decimal points are considered a single token) , 
or a single non-alphanumeric character. A lookup 
module then uses a lexicon with part-of-speech tags 
for each token. This lexicon includes information 
about the frequency with which each word occurs as 
each possible part-of-specch. The lexicon and the 
frequency counts were also taken from the PARTS 
tagger, which derived the counts from the Brown 
corpus (Francis and Kucera, 1982). For the word 



curred 26 times in the Brown corpus - twice as an 
adjective and 24 times as a singular noun. 

The lexicon contains 77 part-of-speech tags, which 
we map into 18 more general categories (see Figure 
1). For example, the tags for present tense verb, past 
participle, and modal verb all map into the more 
general "verb" category. For a given word and cate- 
gory, the frequency of the category is the sum of the 
frequencies of all the tags that are mapped to the 
category for that word. The 18 category frequen- 
cies for the word are then converted to probabilities 
by dividing the frequencies for each category by the 
total number of occurrences of the word. 

For each token that appears in the input stream, a 
descriptor array is created consisting of the 18 prob- 
abilities as well as two additional flags that indicate 
if the word begins with a capital letter and if it fol- 
lows a punctuation mark. 

2.2 The Role of the Neural Network 

We accomplish the disambiguation of punctuation 
marks using a feed-forward neural network trained 



with the back propagation algorithm (Hertz et al.. 



1991 



The network accepts as input A; * 20 input 
units, where k is the number of words of context sur- 
rounding an instance of an end-of-sentence punctua- 
tion mark (referred to in this paper as "k-context"), 
and 20 is the number of elements in the descrip- 
tor array described in the previous subsection. The 
input layer is fully connected to a hidden layer con- 
sisting of j hidden units with a sigmoidal squashing 
activation function. The hidden units in turn feed 
into one output unit which indicates the results of 
the function. n 

The output of the network, a single value between 
and 1, represents the strength of the evidence that 
a punctuation mark occurring in its context is in- 
deed the end of the sentence. We define two ad- 
justable sensitivity thresholds tg and ii, which are 
used to classify the results of the disambiguation. 
If the output is less than to, the punctuation mark 
is not a sentence boundary; if the output is greater 
than or equal to ii, it is a sentence boundary. Out- 
puts which fall between the thresholds cannot be 
disambiguated by the network and are marked ac- 



adult, for example, the lookup module would return 
the tags "JJ/2 NN/24," signifying that the word oc- 



*The context of a punctuation mark can be thought 
of as the sequence of tokens preceding and following it. 
Thus this network can be thought of roughly as a Time - 
Delay Neural Network (TDNN) ( |Hertz et al., 199l| ), 
since it accepts a sequence of inputs and is sensitive to 
positional information within the sequence. However, 
since the input information is not really shifted with 
each time step, but rather only presented to the neu- 
ral net when a punctuation mark is in the center of the 
input stream, this is not technically a TDNN. 
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Figure 1: Elements of the Descriptor Array assigned to each incoming token. 



cordingly, so they can be treated specially in later 
processing. When to — ti, every punctuation mark 
is labeled as either a boundary or a non-boundary. 

To disambiguate a punctuation mark in a k- 
context, a window of k+l tokens and their descriptor 
arrays is maintained as the input text is read. The 
first fc/2 and final fc/2 tokens of this sequence repre- 
sent the context in which the middle token appears. 
If the middle token is a potential end-of-sentence 
punctuation mark, the descriptor arrays for the con- 
text tokens are input to the network and the output 
result indicates the appropriate label, subject to the 
thresholds to and ti. 

Section ^ describes experiments which vary the 
size of k and the number of hidden units. 

2.3 Heuristics 

A connectionist network can discover patterns in the 
input data without using explicit rules, but the in- 
put must be structured to allow the net to recognize 
these patterns. Important factors in the effective- 
ness of these arrays include the mapping of part-of- 
speech tags into categories, and assignment of parts- 
of-speech to words not explicitly contained in the 
lexicon. 

As previously described, we map the part-of- 
speech tags in the lexicon to more general categories. 
This mapping is, to an extent, dependent on the 
range of tags and on the language being analyzed. 
In our experiments, when all verb forms in English 
are placed in a single category, the results are strong 
(although we did not try alternative mappings). We 
speculate, however, that for languages like German, 
the verb forms will need to be separated from each 
other, as certain forms occur much more frequently 
at the end of a sentence than others do. Similar 
issuse may arise in other languages. 

Another important consideration is classification 
of words not present in the lexicon, since most texts 
contain infrequent words. Particularly important is 
the ability to recognize tokens that are likely to be 
abbreviations or proper nouns. Miiller (1980) gives 
an argument for the futility of trying to compile an 
exhaustive list of abbreviations in a language, thus 
implying the need to recognize unfamiliar abbrevi- 
ations. We implement several techniques to accom- 



plish this. For example, we attempt to identify ini- 
tials by assigning an "abbreviation" tag to all se- 
quences of letters containing internal periods and 
no spaces. This finds abbreviations like "J.R." and 
"Ph.D." Note that the final period is a punctuation 
mark which needs to be disambiguated, and is there- 
fore not considered part of the word. 

A capitalized word is not necessarily a proper 
noun, even when it appears somewhere other than in 
a sentence's initial position (e.g., the word "Amer- 
ican" is often used as an adjective). We require 
a way to assign probabilities to capitalized words 
that appear in the lexicon but are not registered as 
proper nouns. We use a simple heuristic: we split 
the word's probabilities, assigning a 0.5 probabihty 
that the word is a proper noun, and dividing the 
remaining 0.5 according to the proportions of the 
probabilities of the parts of speech indicated in the 
lexicon for that word. 

Capitalized words that do not appear in the lex- 
icon at all are generally very likely to be proper 
nouns; therefore, they are assigned a proper noun 
probability of 0.9, with the remaining 0.1 probabil- 
ity distributed equally among all the other parts-of- 
speech. These simple assignment rules are effective 
for English, but would need to be slightly modified 
for other languages with different capitalization rules 
(e.g., in German all nouns are capitalized). 

3 Experiments and Results 

We tested the boundary labeler on a large body 
of text containing 27,294 potential sentence-ending 
punctuation marks taken from the Wall Street Jour- 
nal portion of th e ACL/DCI collection ( Church and 



Liberman, 1991 ). No preprocessing was performed 
on the test text, aside from removing unnecessary 
headers and correcting existing errors. (The sen- 
tence boundaries in the WSJ text had been previ- 
ously labeled using a method similar to that used in 



PARTS and is described in more detail in ( Liber 
man and Church, 1992); we found and corrected 
several hundred errors.) We trained the weights in 
the neural network with a back-propagation algo- 
rithm on a training set of 573 items from the same 
corpus. To increase generalization of training, a 



separate cross-validation set (containing 258 items 
also from the same corpus) was also fed through 
the network, but the weights were not trained on 
this set. When the cumulative error of the items in 
the cross-validation set reached a minimum, train- 
ing was stopped. Training was done in batch mode 
with a learning rate of 0.08. The entire training pro- 
cedure required less than one minute on a Hewlett 
Packard 9000/750 Workstation. This should be con- 
trasted with Riley's algorithm which required 25 mil- 
lion words of training data in order to compile prob- 
abilities. 

If we use Riley's statistics presented in Section 
|l|, we can determine a lower bound for a sentence 
boundary disambiguation algorithm: an algorithm 
that always labels a period as a sentence boundary 
would be correct 90% of the time; therefore, any 
method must perform better than 90%. In our ex- 
periments, performance was very strong: with both 
sensitivity thresholds set to 0.5, the network method 
was successful in disambiguating 98.5% of the punc- 
tuation marks, mislabeling only 409 of 27,294. These 
errors fall into two major categories: (i) "false posi- 
tive" : the method erroneously labeled a punctuation 
mark as a sentence boundary, and (ii) "false nega- 
tive" : the method did not label a sentence boundary 
as such. See Table for details. 



224 (54.8%) false positives 
185 (45.2%) false negatives 



409 total errors out of 27,294 items 



Table 1: Results of testing on 27,294 mixed-case 
items; to ^ ti = 0.5, 6-context, 2 hidden units. 

The 409 errors from this testing run can be de- 
composed into the following groups: 

37.6% false positive at an abbreviation within a 
title or name, usually because the word 
following the period exists in the lexicon 
with other parts-of-speech {Mr. Gray, Col. 
North, Mr. Major, Dr. Carpenter, Mr. 
Sharp). Also included in this group are 
items such as U.S. Supreme Court or U.S. 
Army, which are sometimes mislabeled be- 
cause U.S. occurs very frequently at the 
end of a sentence as well. 

22.5% false negative due to an abbreviation at 
the end of a sentence, most frequently Inc., 
Co., Corp., or U.S., which all occur within 
sentences as well. 

11.0% false positive or negative due to a sequence 
of characters including a punctuation mark 
and quotation marks, as this sequence can 



occur both within and at the end of sen- 
tences. 

9.2% false negative resulting from an abbrevia- 
tion followed by quotation marks; related 
to the previous two types. 

9.8% false positive or false negative resulting 
from presence of ellipsis (...), which can oc- 
cur at the end of or within a sentence. 

9.9% miscellaneous errors, including extraneous 
characters (dashes, asterisks, etc.), un- 
grammatical sentences, misspellings, and 
parenthetical sentences. 

The results presented above (409 errors) are ob- 
tained when both to and ti are set at 0.5. Adjust- 
ing the sensitivity thresholds decreases the number 
of punctuation marks which are mislabeled by the 
method. For example, when the upper threshold is 
set at 0.8 and the lower threshold at 0.2, the network 
places 164 items between the two. Thus when the 
algorithm does not have enough evidence to classify 
the items, some mislabeling can be avoided.^ 

We also experimented with different context sizes 
and numbers of hidden units, obtaining the results 
shown in Tables g and y. All results were found using 
the same training set of 573 items, cross-validation 
set of 258 items, and mixed-case test set of 27,294 
items. The "Training Error" is one-half the sum of 
all the errors for all 573 items in the training set, 
where the "error" is the difference between the de- 
sired output and the actual output of the neural net. 
The "Cross Error" is the equivalent value for the 
cross-validation set. These two error figures give an 
indication of how well the network learned the train- 
ing data before stopping. 

We observed that a net with fewer hidden units 
results in a drastic decrease in the number of false 
positives and a corresponding increase in the number 
of false negatives. Conversely, increasing the number 
of hidden units results in a decrease of false negatives 
(to zero) and an increase in false positives. A net- 
work with 2 hidden units produces the best overall 
error rate, with false negatives and false positives 
nearly equal. 

From these data we concluded that a context of 
six surrounding tokens and a hidden layer with two 
units worked best for our test set. 

After converting the training, cross-validation and 
test texts to a lower-case-only format and retraining, 
the network was able to successfully disambiguate 
96.2% of the boundaries in a lower-case-only test 
text. Repeating the procedure with an upper-case- 
only format produced a 97.4% success rate. Unlike 



''We will report on results of varying the thresholds 
in future work. 



Context Training Training Cross Testing Testing 

Size Epochs Error Error Errors Error (%) 



4-context 
6-context 
8-context 



1731 
218 
831 



1.52 

0.75 

0.043 



2.36 
2.01 

1.88 



1424 
409 

877 



5.22% 
1.50% 
3.21% 



Table 2: Results of comparing context sizes (2 hidden units). 



# Hidden 


Training 


Training 


Cross 


Testing 


Testing 


Units 


Epochs 


Error 


Error 


Errors 


Error (%) 


1 


623 


1.05 


1.61 


721 


2.64% 


2 


216 


1.08 


2.18 


409 


1.50% 


3 


239 


0.39 


2.27 


435 


1.59% 


4 


350 


0.27 


1.42 


1343 


4.92% 



Table 3: Results of comparing hidden layer sizes (6-context) 
validation set of 258 items. 



Training was done on 573 items, using a cross 



most existing methods which rely heavily on capital- 
ization information, the network method is reason- 
ably successful at disambiguating single-case texts. 

4 Discussion and Future Work 

We have presented an automatic sentence boundary 
labeler which uses probabilistic part-of-speech infor- 
mation and a simple neural network to correctly dis- 
ambiguate over 98.5% of sentence-boundary punctu- 
ation marks. A novel aspect of the approach is its 
use of prior part-of-speech probabilities, rather than 
word tokens, to represent the context surrounding 
the punctuation mark to be disambiguated. This 
leads to savings in parameter estimation and thus 
training time. The stochastic nature of the input, 
combined with the inherent robustness of the con- 
nectionist network, produces robust results. The al- 
gorithm is to be used in conjunction with a part- 
of-speech tagger, and so assumes the availability of 
a lexicon containing prior probabilities of parts-of- 
speech. The network is rapidly trainable and thus 
should be easily adaptable to new text genres, and is 
very efficient when used in its labeling capacity. Al- 
though the systems of Wasson and Riley (198£) re- 
port slightly better error rates, our approach has the 
advantage of flexibility for application to new text 
genres, small training sets (and hence fast training 
times), (relatively) small storage requirements, and 
little manual effort. Futhermore, additional experi- 
mentation may lower the error rate. 

Although our results were obtained using an En- 
glish lexicon and text, we designed the boundary 
labeler to be equally applicable to other languages, 
assuming the accessibility of lexical part-of-speech 
frequency data (which can be obtained by running 



a part-of-speech tagger over a large corpus of text, 
if it is not available in the tagger itself) and an ab- 
breviation list. The input to the neural network is 
a language-independent set of descriptor arrays, so 
training and labeling would not require recoding for 
a new language. The heuristics described in Section 
g may need to be adjusted for other languages in 
order to maximize the efficacy of these descriptor 
arrays. 

Many variations remain to be tested. We plan to: 
(i) test the approach on French and perhaps Ger- 
man, (ii) perform systematic studies on the effects 
of asymmetric context sizes, different part-of-speech 
categorizations, different thresholds, and larger de- 
scriptor arrays, (iii) apply the approach to texts with 
unusual or very loosely constrained markup formats, 
and perhaps even to other markup recognition prob- 
lems, and (iv) compare the use of the neural net with 
more conventional tools such as decision trees and 
Hidden Markov Models. 
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